In this project, I explored online retail transactions to understand customer behavior, sales patterns, and opportunities for growth. Instead of presenting only numbers and charts, this notebook tells a story: every analysis is followed by insights and strategies that can directly guide business decisions.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
The Online Retail dataset from the UCI Machine Learning Repository contains ~542k transactions from a UK-based non-store retailer between Dec 2010 and Dec 2011. The company mainly sells unique all-occasion gifts, and many customers are wholesalers.
First, I loaded the dataset. To ensure our analysis is accurate, I cleaned the data by removing invalid transactions and missing customer IDs.
df = pd.read_excel("Online Retail.xlsx")
# Drop missing CustomerID and invalid transactions
df = df.dropna(subset=["CustomerID"])
df = df[(df["Quantity"]>0) & (df["UnitPrice"]>0)]
# Convert InvoiceDate to datetime
df["InvoiceDate"] = pd.to_datetime(df["InvoiceDate"])
# Feature: TotalPrice
df["TotalPrice"] = df["Quantity"] * df["UnitPrice"]
# Get the latest date
snapshot_date = df["InvoiceDate"].max()
print("Cleaned dataset shape:", df.shape)
Cleaned dataset shape: (397884, 9)
After cleaning, we have ~398k transactions across 9 columns, ready for analysis.
df.head()
| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | TotalPrice | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010-12-01 08:26:00 | 2.55 | 17850.0 | United Kingdom | 15.30 |
| 1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
| 2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 2010-12-01 08:26:00 | 2.75 | 17850.0 | United Kingdom | 22.00 |
| 3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
| 4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2010-12-01 08:26:00 | 3.39 | 17850.0 | United Kingdom | 20.34 |
Next, I asked: “Which products generate the most revenue?”
I aggregated total sales by product and visualized the top 10 best-sellers.
top_products = df.groupby("Description")["TotalPrice"].sum().sort_values(ascending=False).head(10)
px.bar(top_products[::-1], x=top_products.values[::-1], y=top_products.index[::-1],
orientation='h', title="Top 10 Products by Revenue").show()
We found that a small group of products, such as Paper Craft, Little Birdie and Regency Cakestand 3 Tier, contributed disproportionately to total revenue. This shows that the business relies heavily on a few key items.
Strategy: Highlight these key items in marketing campaigns and create premium bundles to increase basket value.
I then explored: “Where do most of our sales come from?”
top_countries = df.groupby("Country")["TotalPrice"].sum().sort_values(ascending=False).head(10)
px.bar(top_countries[::-1], x=top_countries.values[::-1], y=top_countries.index[::-1],
orientation='h', title="Top 10 Countries by Revenue").show()
The UK clearly dominated, contributing the majority of sales, while other countries played a much smaller role. This suggests that the UK is the main market driving business growth.
Strategy: Prioritize UK customers in marketing and logistics. Use smaller markets to test new promotions or localized campaigns.
Next question: “How does revenue change over the last 6 months?”
# Last 6 months
six_months_ago = snapshot_date - pd.DateOffset(months=6)
df_recent = df[df["InvoiceDate"] >= six_months_ago].copy()
df_recent["InvoiceMonth"] = df_recent["InvoiceDate"].dt.to_period("M").astype(str)
# Aggregate monthly sales
monthly_sales = df_recent.groupby("InvoiceMonth")["TotalPrice"].sum().reset_index()
monthly_sales = monthly_sales.sort_values("InvoiceMonth")
# Visualize trend
px.line(monthly_sales, x="InvoiceMonth", y="TotalPrice",
title="Monthly Sales Trend (Last 6 Months)", markers=True).show()
By aggregating monthly sales and visualizing the trend, I noticed fluctuations, with a noticeable dip in November. This could indicate seasonal variations in demand or potential stock shortages during that period.
Strategy: Plan inventory and promotions in advance to smooth out sales fluctuations and capture peak demand.
To answer “How can we group customers for tailored marketing?”, I used RFM (Recency, Frequency, Monetary) analysis.
# RFM calculation
rfm = df.groupby("CustomerID").agg({
"InvoiceDate": lambda x: (snapshot_date - x.max()).days,
"InvoiceNo": "nunique",
"TotalPrice": "sum"
}).rename(columns={"InvoiceDate":"Recency","InvoiceNo":"Frequency","TotalPrice":"Monetary"})
# Scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(rfm[["Recency","Frequency","Monetary"]])
from sklearn.cluster import KMeans
# Elbow method
inertia = [KMeans(n_clusters=k, random_state=42).fit(rfm_scaled).inertia_ for k in range(1,8)]
plt.plot(range(1,8), inertia, 'o-')
plt.xlabel("k"); plt.ylabel("Inertia"); plt.title("Elbow Method")
plt.show()
Elbow analyses suggest 3 clusters as the optimal choice.
# Train K-Means
kmeans = KMeans(n_clusters=3, random_state=42)
rfm["Segment"] = kmeans.fit_predict(rfm_scaled)
# Segment Profile
rfm_profile = rfm.groupby("Segment").agg({"Recency":"mean","Frequency":"mean","Monetary":["mean","count"]}).round(2)
print(rfm_profile)
Recency Frequency Monetary
mean mean mean count
Segment
0 39.98 4.85 2012.11 3231
1 245.02 1.58 631.14 1093
2 6.14 80.21 122888.41 14
# 3D visualization
fig_rfm3d = px.scatter_3d(rfm, x='Recency', y='Frequency', z='Monetary',
color='Segment', size='Monetary', opacity=0.7,
title="3D RFM Segments")
fig_rfm3d.show()
Using K-Means clustering, we found three clear segments:
Strategy:
I wanted to answer the question: “Can we predict which customers are likely to be high-value?”
To do this, I defined high-value customers as the top 25% in total spending and built a Random Forest model to predict them.
# Define target
q3 = df.groupby("CustomerID")["TotalPrice"].sum().quantile(0.75)
high_value_customers = df.groupby("CustomerID")["TotalPrice"].sum() > q3
# Features
customer_features = df.groupby("CustomerID").agg({
"InvoiceDate": lambda x: (snapshot_date - x.max()).days,
"InvoiceNo": "nunique",
"UnitPrice": "mean",
"Quantity": "mean",
"Country": lambda x: x.mode()[0]
}).rename(columns={"InvoiceDate":"Recency","InvoiceNo":"Frequency","UnitPrice":"AvgUnitPrice","Quantity":"AvgQuantity","Country":"Country"})
customer_features = pd.get_dummies(customer_features, columns=["Country"], drop_first=True)
X = customer_features
y = high_value_customers.astype(int)
# Scaling
X_scaled = scaler.fit_transform(X)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2,
random_state=42, stratify=y)
from sklearn.ensemble import RandomForestClassifier
# Train Random Forest
clf = RandomForestClassifier(n_estimators=200, random_state=42, class_weight='balanced')
clf.fit(X_train, y_train)
# Predict probabilities and classify high-value customers
y_proba = clf.predict_proba(X_test)[:,1]
y_pred = (y_proba >= 0.4).astype(int) # threshold 0.4 to capture more high-value customers
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix
print("ROC-AUC:", roc_auc_score(y_test, y_proba).round(3))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
ROC-AUC: 0.943
Confusion Matrix:
[[614 37]
[ 55 162]]
precision recall f1-score support
0 0.92 0.94 0.93 651
1 0.81 0.75 0.78 217
accuracy 0.89 868
macro avg 0.87 0.84 0.85 868
weighted avg 0.89 0.89 0.89 868
The Random Forest model achieved a ROC-AUC of 0.943, demonstrating strong predictive power. This means we can anticipate which customers are most likely to contribute significantly to revenue.
Strategy: Design personalized retention and upselling campaigns targeting these predicted high-value customers to maximize return on marketing investments.
Using FP-Growth on UK transactions, I identified products frequently bought together, which raises the question: “Which products are often bought together?”.
from mlxtend.frequent_patterns import fpgrowth, association_rules
# Pivot table & filter UK transactions
basket_sets = (df[df['Country']=="United Kingdom"]
.groupby(['InvoiceNo','Description'])['Quantity']
.sum().unstack().loc[:, lambda x: x.sum() > 10] > 0)
# FP-Growth & association rules
frequent_itemsets = fpgrowth(basket_sets, min_support=0.01, use_colnames=True)
rules_top = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2) \
.sort_values('lift', ascending=False).head(10)
rules_top
| antecedents | consequents | antecedent support | consequent support | support | confidence | lift | representativity | leverage | conviction | zhangs_metric | jaccard | certainty | kulczynski | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 540 | (HERB MARKER THYME) | (HERB MARKER ROSEMARY) | 0.010753 | 0.010873 | 0.010153 | 0.944134 | 86.829038 | 1.0 | 0.010036 | 17.705365 | 0.999228 | 0.884817 | 0.943520 | 0.938918 |
| 541 | (HERB MARKER ROSEMARY) | (HERB MARKER THYME) | 0.010873 | 0.010753 | 0.010153 | 0.933702 | 86.829038 | 1.0 | 0.010036 | 14.921137 | 0.999350 | 0.884817 | 0.932981 | 0.938918 |
| 915 | (REGENCY TEA PLATE GREEN ) | (REGENCY TEA PLATE ROSES ) | 0.013637 | 0.015980 | 0.011534 | 0.845815 | 52.930211 | 1.0 | 0.011316 | 6.382074 | 0.994671 | 0.637874 | 0.843311 | 0.783810 |
| 914 | (REGENCY TEA PLATE ROSES ) | (REGENCY TEA PLATE GREEN ) | 0.015980 | 0.013637 | 0.011534 | 0.721805 | 52.930211 | 1.0 | 0.011316 | 3.545575 | 0.997040 | 0.637874 | 0.717958 | 0.783810 |
| 617 | (POPPY'S PLAYHOUSE LIVINGROOM ) | (POPPY'S PLAYHOUSE BEDROOM ) | 0.012556 | 0.015619 | 0.010153 | 0.808612 | 51.769856 | 1.0 | 0.009956 | 5.143389 | 0.993153 | 0.563333 | 0.805576 | 0.729306 |
| 616 | (POPPY'S PLAYHOUSE BEDROOM ) | (POPPY'S PLAYHOUSE LIVINGROOM ) | 0.015619 | 0.012556 | 0.010153 | 0.650000 | 51.769856 | 1.0 | 0.009956 | 2.821270 | 0.996244 | 0.563333 | 0.645550 | 0.729306 |
| 928 | (SET OF 3 WOODEN STOCKING DECORATION) | (SET OF 3 WOODEN TREE DECORATIONS) | 0.014959 | 0.013757 | 0.010333 | 0.690763 | 50.211536 | 1.0 | 0.010127 | 3.189279 | 0.994968 | 0.562092 | 0.686450 | 0.720927 |
| 929 | (SET OF 3 WOODEN TREE DECORATIONS) | (SET OF 3 WOODEN STOCKING DECORATION) | 0.013757 | 0.014959 | 0.010333 | 0.751092 | 50.211536 | 1.0 | 0.010127 | 3.957447 | 0.993755 | 0.562092 | 0.747312 | 0.720927 |
| 619 | (POPPY'S PLAYHOUSE KITCHEN) | (POPPY'S PLAYHOUSE LIVINGROOM ) | 0.017301 | 0.012556 | 0.010693 | 0.618056 | 49.225611 | 1.0 | 0.010476 | 2.585309 | 0.996934 | 0.557994 | 0.613199 | 0.734865 |
| 618 | (POPPY'S PLAYHOUSE LIVINGROOM ) | (POPPY'S PLAYHOUSE KITCHEN) | 0.012556 | 0.017301 | 0.010693 | 0.851675 | 49.225611 | 1.0 | 0.010476 | 6.625290 | 0.992142 | 0.557994 | 0.849063 | 0.734865 |
For example, customers who bought Herb Marker Rosemary often also purchased Herb Marker Thyme.
Strategy: Implement “Frequently Bought Together” recommendations and create bundled promotions to increase average order value.
Through the analysis of nearly 398,000 cleaned transactions, we uncovered actionable insights:
By turning analytics into clear strategies, this project demonstrates how data-driven decisions can boost revenue, improve customer loyalty, and drive sustainable growth in e-commerce.